Writing functions in R

StratoBayes workshop, 18th - 20th March 2024, Durham, UK

Kilian Eichenseer, Durham University

Function basics

What is a function

takes input –> does something –> returns output

mean(c(1, 2, 3))
[1] 2


A function needs a name, arguments in (), and a body in {}


subtract <- function(arg1, arg2) { 
  arg1 - arg2 
} 


subtract(2, 1)
[1] 1

Why do we need functions

  • Readability
  • Organisation
  • Modularity
  • Reusability

Imagine calculating the mean without standard functions like mean or sum:

  data <- c(1,2,3)
  total <- 0
  count <- 0
  for (value in data) {
    total <- total + value
    count <- count + 1
  }
  total/count
[1] 2

Arguments

Arguments need to be provided in the correct order, or specified by name:

subtract(2, 1)
[1] 1


subtract(1, 2)
[1] -1


subtract(arg2 = 1, arg1 = 2)
[1] 1

Default values

Make function use more convenient, can hide complexities.

subtract(2)
Error in subtract(2): argument "arg2" is missing, with no default


This will work if we set a default for arg2:

subtract <- function(arg1, arg2 = 1) {
  arg1 - arg2
}


subtract(2)
[1] 1


Ellipsis (‘…’)

Additional, optional arguments can be allowed by using ‘…’ as the last argument:

my_plot <- function(arg1, arg2, ...) {
  plot(arg1, arg2, ...)
}


my_plot(2, 1, col = "red", pch = 17, cex = 2)

Return

A function generally should return something, but this does not:

subtract <- function(arg1, arg2) {
  result <- arg1 - arg2
}
subtract(2,1)


Return explicitly with return, or place return value at the end of the function:

subtract <- function(arg1, arg2) {
  result <- arg1 - arg2
  return(result)
}
subtract(2,1)
[1] 1


Return multiple objects

just_return <- function(arg1, arg2) {
return(arg1)
return(arg2)
}
just_return(2, 1)
[1] 2

This did not work as intended. R functions only return one object. Instead use lists or other data structures:

just_return <- function(arg1, arg2) {
return(c(arg1, arg2))
}
just_return(2, 1)
[1] 2 1

Exercise 1a - Building a function

Write a function that adds two values and creates a scatter plot with first value on the x-axis and the result on the y axis.

Hint

Click to expand/collapse First, we add the two values and save it in an object within the function. Then, we use the plot function to create a scatter plot with x being the first value, and y being the object we created with the addtion.

Solution

Click to expand/collapse
add_and_plot <- function(a, b) {
  result <- a + b
  plot(a, result)
}

Exercise 1b - Expanding a function

Expand the function you just created to let it return the result after plotting

Hint

Click to expand/collapse We can place the object we want to return at the end of our function, or use the return function.

Solution

Click to expand/collapse
add_and_plot <- function(a, b) {
  result <- a + b
  plot(a, result)
  result
}

Exercise 1c - Passing additional arguments

Expand the function further by allowing customisation of the plot by changing graphical parameters.

Hint

Click to expand/collapse We can use ellipsis (...) as a function argument in the outer and the inner function to allow for additional arguments being passed.

Solution

Click to expand/collapse
add_and_plot <- function(a, b, ...) {
  result <- a + b
  plot(a, result, ...)
  result
}

Binary operators

Binary operators

Standard function syntax:

sum(c(1,2))
[1] 3

Operator syntax:

1 + 2
[1] 3

Most binary operators come in %:

3 %in% c(1,2,3)
[1] TRUE


Custom binary operators – let’s define an operator for “not in”:

`%!in%` <- function(x, y) !(x %in% y)
3 %!in% c(1,2,3)
[1] FALSE

Control structures

Control structures – if

if a condition is true, do something.

if (1 + 1 == 2) print("True")
[1] "True"


add_or_subtract <- function(arg1, arg2, operation) {
 if (operation == "add") {
   result <- arg1 + arg2
 }
 if (operation == "subtract") {
   result <- arg1 + arg2
 }
 result
}
add_or_subtract(2,1,"add")
[1] 3

Control structures – else

else instructs what to do when the if condition is not met.

if (1 + 1 == 3) print("True") else print("False")
[1] "False"


add_or_subtract <- function(arg1, arg2, operation) {
 if (operation == "add") {
   result <- arg1 + arg2
 } else {
   result <- arg1 - arg2
 }
 return(result)
}
add_or_subtract(2,1,"subtract")
[1] 1

Control structures – switch

Instead of many if and else statements, try switch

fossil_description <- function(fossil) {
 switch(fossil,
  ammonite = "coiled shell",
  Tyrannosaurus = "serrated teeth",
  Lepidodendron = "scaly bark", 
  "not a fossil"
 )
}
fossil_description("Tyrannosaurus")
[1] "serrated teeth"
fossil_description("laptop")
[1] "not a fossil"

Control structures – for loops

Loops are used for repeating similar actions multiple times. for loops iterate over a set of values. The iterator (i) changes with every iteration of the loop:

for (i in c(1,2,3)) print(i)
[1] 1
[1] 2
[1] 3

To generate sequences of integers, we can use seq_len. Let’s make a function:

print_repetitions <- function(n) {
 for (i in seq_len(n)) { 
   print(i)
 }
}
print_repetitions(2)
[1] 1
[1] 2

Control structures – while loops

while loops repeat a task until a condition is no longer met.

add_until_4 <- function(x) {
  while(x < 4) {
    x <- x + 1
    print(x)
  }
}
add_until_4(1)
[1] 2
[1] 3
[1] 4

Exercise 2a - Building a loop

Create a for loop that calculates the mean of all numerical columns of the Talat_isotopes dataset. If the dataset is not already loaded, we can read it with:

# read the "Talat_n_Yissi_isotopes.csv" from session 3:
Talat_isotopes <- read.csv("data/Talat_n_Yissi_isotopes.csv", 
                           header = TRUE)

Hint 1

Click to expand/collapse

To check if a column is numeric, we can use is.numeric on the column. Using if and possibly else allows us to do different operations depending on the outcome of is.numeric.

Selecting a column can be done with square brackets, e.g. Talat_isotopes[ ,1] selects the first column.

To get the number of columns of Talat_isotopes, we can use the ncol function.

Hint 2

Click to expand/collapse Start a loop for example with for (i in seq_len(ncol(Talat_isotopes))) { } to run it for one iteration per number of columns. Everything inside the curly brackets is done exactly the same in each iteration, except i changes. To print the output at every iteration of the function, we can use the print function.

Solution

Click to expand/collapse
# save number of columns
nCol <- ncol(Talat_isotopes)
# run loop for nCol iterations
for (i in seq_len(nCol)) {
  if(is.numeric(Talat_isotopes[,i])) {
    print(mean(Talat_isotopes[,i]))
  }
}
[1] -0.335762
[1] -10.78194
[1] 411.5718
[1] 519.5261
[1] 205.5268

Exercise 2b - Turn the loop into a function

From the loop created in exercise 1, create a function that takes the Talat_isotopes dataframe (or any other dataframe) as input and returns the means of numerical columns as output.

Hint

Click to expand/collapse We can return the output by saving the result of every iteration of the loop in an object, and then return that object. We can omit NA output by indexing with !is.na.

Solution

Click to expand/collapse
numericColumnMean <- function(df) {
  # save number of columns
  nCol <- ncol(df)
  # create a vector for the output
  output <- rep(NA,nCol)
  # run loop for nCol iterations
  for (i in seq_len(nCol)) {
    if(is.numeric(df[,i])) {
      output[i] <- mean(df[,i])
    } else {  # NA for non-numeric columns:
      output[i] <- NA 
      }
  }
  # return output that is not NA
  output[!is.na(output)]
}
# apply function
numericColumnMean(Talat_isotopes)
[1]  -0.335762 -10.781942 411.571816 519.526138 205.526785

Apply and similar functions

apply

  • apply and related functions apply a function to elements of arrays, lists, …
  • more concise than loops and don’t change global environment

For example, to get the classes of the first three columns of the Talat_isotopes data:

Talat_classes <- rep(NA,3)
for (i in 1:3) {
  Talat_classes[i] <- class(Talat_isotopes[,i])
}
Talat_classes
[1] "character" "numeric"   "numeric"  

Or with apply:

apply(Talat_isotopes[,1:3], MARGIN = 2, FUN = class)
     sample        d13C        d18O 
"character" "character" "character" 

apply

Let’s have a look what happened there. apply(X, MARGIN, FUN) takes an array X (our Talat_isotopes dataframe), and applies a function (FUN) to elements of that array, specified by MARGIN.

MARGIN = 2 indicates columns, MARGIN = 1 would be rows. So here we applied the class function to every column of Talat_isotopes.

apply simplifies the output, so here it returned a vector with one element for each column.

lapply

lapply is similar to apply but for list or vector input. It returns a list for each element of the data.

data <- list(c(1, 2, 3), c(4, 5))
lapply(data, mean)
[[1]]
[1] 2

[[2]]
[1] 4.5

sapply does the same, but also tries to simplify the output; here we get a vector:

data <- list(c(1, 2, 3), c(4, 5))
sapply(data, mean)
[1] 2.0 4.5

vapply

vapply is a safer version of sapply, it requires the user to specify the anticipated class and length of the elements of the output:

data <- list(c(1, 2, 3), c(4, 5))
vapply(data, FUN = mean, FUN.VALUE = numeric(1))
[1] 2.0 4.5

Here, we specified in FUN.VALUE that each element of the output should be a numeric of length 1.

Exercise 3a - Use apply to calculate column means

In exercise 2, we built a function that calculated the mean of all numerical columns of the Talat_isotopes dataset and returned it.

Can we achieve the same using apply?

This task is trickier than it looks like as apply coerces all columns to type character as soon as there is even one character column. We can add as.numeric to try and force columns into numeric. Better yet, we just select the columns containing values that can be coerced to be numeric:

numeric_columns <- c("d18O", "d13C", "m", "age", "m.my")
Talat_numeric <- Talat_isotopes[ ,numeric_columns]

Hint 1

Click to expand/collapse

The second argument of apply, MAR, needs to be set to 2 for columnwise operations.

In the third argument, we can define a function, for example function(col) { } will modify each column according to the instructions in curly brackets, refering to the column as col.

Solution

Click to expand/collapse
apply(Talat_numeric, 2, function (col) {
  mean(col)
  }
)
      d18O       d13C          m        age       m.my 
-10.781942  -0.335762 411.571816 519.526138 205.526785 

Or simply:

apply(Talat_numeric, 2, mean)
      d18O       d13C          m        age       m.my 
-10.781942  -0.335762 411.571816 519.526138 205.526785 

Exercise 3b - Reading a messy dataset

The Talat_isotopes and Talat_elements datasets are contained in a mulit-sheet Excel file from the supplementary materials Maloof et al. (2010). It is in the data folder and named Maloof_et_al_2010_SM.xls.

Reading such files with R can be a challenge and it is often easier to copy-paste the relevant data into clean spreadsheets and save them as .csv files.

However, here we try to automate the reading of sheets from this particular file, starting from the sheet Morocco-Talat n’ Yissi`.

We start by installing and loading the readxl package.

# Install and load the readxl package if you haven't already
if (!require(readxl)) install.packages("readxl")
library(readxl)

We notice that the Morocco-Talat n' Yissi sheet has two header rows, and the formatting is messy.

We read in the header rows of the sheet 4. Morocco-Talat n' Yissi:

# specify the file path
file_path <- "data/Maloof_et_al_2010_SM.xls"
# specify the sheet name
sheet_name <- "4. Morocco-Talat n' Yissi"
# read the top two rows
# trim_ws stops omitting empty cells
# col_names = FALSE stops cells being used as column names
header_rows <- read_excel(file_path, sheet = sheet_name, n_max = 2,
                          trim_ws = FALSE, col_names = FALSE)

Next, we merge the header rows as we want only a single column name per column:

column_names <- apply(header_rows, 2, function(x) paste(na.omit(x), collapse = "_"))

Finally, we use these column names for reading the entire data set, skipping the first two problematic rows:

talat <- read_excel(file_path, sheet = sheet_name, skip = 2, col_names = column_names)

The result looks reasonable:

head(talat[ ,c(1:4)])
# A tibble: 6 × 4
  `0_sample`  d13C   d18O d13Corg
  <chr>      <dbl>  <dbl> <lgl>  
1 M250-2.1   -2.07  -7.95 NA     
2 M250-2.6   -1.78 -10.7  NA     
3 M250-3.6   -2.04 -10.9  NA     
4 M250-4.4   -2.34  -9.53 NA     
5 M250-6.1   -2.04  -8.92 NA     
6 M250-7.5   -1.59  -6.84 NA     

Exercise: Can you build a function from the above and apply it to other sheets of the Maloof_et_al_2010_SM.xls file? We can also build a second function that uses this function to read multiple sheets at once.

Hint

Click to expand/collapse

We can almost copy-paste the lines we used to read the sheet earlier and collate them in a function.

We need two arguments: the file_path, and the sheet_index.

lapply is useful if we want to use that function on multiple sheets at once.

Solution

Click to expand/collapse

Here is a function that reads a single sheet:

read_sheets_Maloof <- function(file_path, sheet_index) {
  # read the top two rows
  header_rows <- read_excel(file_path, sheet = sheet_index, n_max = 2,
                            trim_ws = FALSE, col_names = FALSE)
  # merge to column names
  column_names <- apply(header_rows, 2, function(x) {
    paste(na.omit(x), collapse = "_")
  })
  # read the rest of the file
  output <- read_excel(file_path, sheet = sheet_index, 
                       skip = 2, col_names = column_names)
  output
}

Using lapply, we can build a simple function that uses the previous function to read multiple sheets at once and saves them as a list of dataframes. Here, we use the indices of the sheets, rather than the names, so we don’t have to type out the complicated sheet names.

multi_sheets_Maloof <- function(file_path, sheet_index, 
                                read_function) {
  # do this for every sheet index (integers)
  lapply(sheet_index, function(s) {
    read_function(file_path = file_path, sheet_index = s)
  })
}
From here, you could try to combine the dataframes e.g. using merge.

Environment and scoping

Environment

Environment can be conceptualised as a place where objects with a name and value exist.

For example, when we run

x <- 1

x now exists in the global environment, and has the value 1.


Each function, for loop, …, creates its own environment.

Environment

If we run the following function to assign to b the value of a

a_to_b <- function(a) b <- a
a_to_b(a = 1)

and then look for b in the global environment

b
Error in eval(expr, envir, enclos): object 'b' not found

we get an error because b only existed within the function environment.


More on environments: adv-r.hadley.nz/environments.html

Scoping

R uses scoping rules to look for variables (or functions). If a variable is not found in a function environment, R looks in the parent environment (i.e. the environment in which the function was created).

x <- 1
double_x <- function(y) 2 * x + y
double_x(0)
[1] 2

x is a free variable in the double_x function – it is not supplied to or defined in the function. Instead, it’s looked up in the environment where double_x was created, the global environment.

This can get tricky, see here for more details: bookdown.org/rdpeng/rprogdatascience/scoping-rules-of-r.html

Exercise 4 - Use vapply to return the three highest d13C values for every stage in Talat_isotopes

As a starting point, extract the unique stage names from the Talat_isotopes dataframe using the unique function:

unique_stages <- unique(Talat_isotopes$stage)

Hint

Click to expand/collapse

We can use this as our first argument to vapply, and use subset to select only the values of d13C that are from a specific stage.

The sort function can be useful to sort the d13C values in decreasing order, for example:

# function to return three largest elements of a vector
three_largest <- function(x) {
  # sort vector in decreasing order
  x <- sort(x, decreasing = TRUE)
  # return the first three values
  x[1:3]
}

Instead of passing Talat_isotopes as an argument, we can let it look up in the global environment, where it was defined.

The third argument of vapply needs to specify the class and length of the output. In our case, it will be three numeric values: numeric(3)

Solution

Click to expand/collapse

The first solution takes advantage of R functions being able to find variables that were defined outside the function environment:

# build a custom function for vapply
three_largest_stages <- function(s) {
  # subset to only include rows from the current stage
  # note: we haven't passed Talat_isotopes as a function argument
  Talat_isotopes <- subset(Talat_isotopes, stage == s)
  # extract three largest values
  three_largest(Talat_isotopes$d13C)
}
# put everything together
vapply(unique_stages, FUN = three_largest_stages, 
       FUN.VALUE = numeric(3))
        T    A    B
[1,] 0.74 3.46 1.79
[2,] 0.37 3.44 1.78
[3,] 0.34 3.43 1.76

vapply has returned the output as a matrix.

The function above works because Talat_isotopes is accessed from the global environment, but only modified within the function environment.

Caution: This can lead to confusion and it is often better to avoid this.

A cleaner solution would be to pass Talat_isotopes as a second argument to the three_largest_stages function.

Testing and Debugging

Error handling

Very helpful in complex functions

subtract(2, 1)
[1] 1
subtract("2", 1)
Error in arg1 - arg2: non-numeric argument to binary operator

Check that input is correct and display custom error messages:

subtract <- function(arg1, arg2) {
  if (is.numeric(arg1) == FALSE) {
    stop("`arg1` must be numeric") 
    }
  arg1 - arg2
}
subtract("2", 1)
Error in subtract("2", 1): `arg1` must be numeric

tryCatch

Use if you anticipate an error but want function to continue.

Let’s try to generate data from a multivariate normal distribution:

mvnfast::rmvn(1, mu = c(0,0), sigma = matrix(rep(1,4), 2))
Error in eval(expr, envir, enclos): chol(): decomposition failed

mvnfast::rmvn is fast but fails for some problematic sigma values. In case it fails, we use MASS::mvrnorm instead:

my_rmvn <- function(n, mu, sigma) {
  tryCatch(mvnfast::rmvn(n, mu, sigma),
           error = function(e) MASS::mvrnorm(n, mu, sigma))
}
my_rmvn(n = 1, mu = c(0,0), sigma = matrix(rep(1,4), 2))
[1] 0.7983643 0.7983643

Traceback

If something went wrong, find out where using traceback():

my_rmvn(n = 1, mu = 0, sigma = matrix(rep(1,4), 2))
Error in MASS::mvrnorm(n, mu, sigma): incompatible arguments
traceback()

Break points

Break points allow you to look inside your function’s environment. This works best when the function is in a separate script. Click next to a line of code in your function to activate a break point (a red dot appears):

Now run the function…

my_rmvn(n = 1, mu = c(0,0), sigma = matrix(rep(1,4), 2))

You may need to save the script and source it (press source in the Rstudio toolbar)

Break points

Break points

We can now browse the function environment in the console like we normally can browse the global environment. For example we can look at sigma:

Press Stop to end the browsing. Don’t forget to deactivate the break point by clicking on the red dot in the script.

Advanced tools and concepts

do.call

do.call() takes a function and a list of arguments, and applies the function to the arguments. This can be useful in a variety of situations:


To combine data:

a <- list(c(1, 2, 3), c(4, 5, 6))
do.call(rbind, a)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6


When we have a function name as a string:

my_function <- "mean"
do.call(my_function, list(c(1, 2, 3)))
[1] 2

do.call

With dynamic arguments:

For example, we want to allow the user to specify additional arguments for a custom plot function, including xlim. In case he doesn’t specify it, we want to automatically generate xlim values:

my_plot <- function(x, ...) {
    args <- list(...) # save the additional arguments
  if("xlim" %in% names(args)) {
    xlim <- args$xlim # if user provides xlim, use that
    args$xlim <- NULL # to avoid argument duplication
  } else {
    xlim <- range(x) + c(-1, 1) # automatic xlim
  }
  do.call(plot, c(list(x = x, xlim = xlim), args))
}

do.call

par(mfrow = c(1,2))
my_plot(1:3, pch = 17, 
        main = "automatic xlim")
my_plot(1:3, pch = 17, xlim = c(1, 5), 
        main = "user-specified xlim")

Measuring performance

If you have large data sets and complex functions, you may want to enhance their performance.


To check how long an operation takes, use system.time:

system.time(rnorm(10^6))
   user  system elapsed 
   0.03    0.00    0.07 

Measuring performance

The microbenchmark package performs an operation many times, and measures the average time it takes. You can also compare different operations.


What is faster, rnorm or rlnorm?

microbenchmark::microbenchmark(rnorm(10^4),
                               rlnorm(10^4))
Unit: microseconds
         expr   min    lq     mean median      uq    max neval
  rnorm(10^4) 319.9 350.0  414.352 373.80  450.05  875.4   100
 rlnorm(10^4) 840.9 894.6 1035.265 928.65 1154.50 2292.2   100

Profiling

The profvis package lets you identify bottlenecks in your code:

short_pause <- function(x) Sys.sleep(0.1 * x)
long_pause <- function(x) Sys.sleep(0.2 * x)

time_waste <- function(x) {
  short_pause(x)
  long_pause(x)
}
profvis::profvis(time_waste(1))

Speeding up functions

Only spend time trying to make your code faster if

  • it works as intended
  • you have identified the parts that are slowing it down


Here is a good overview on making R functions run faster: Best coding practices in R

Memory allocation

Pre-allocating memory is faster than growing objects repeatedly.

Assume, we have recorded the results of 1,000 dice rolls:

data <- sample.int(6, n = 100, replace = TRUE)

Now, we want to check which ones show “6”. Compare these two approaches:

is_six_1 <- function(x) {
  res <- NULL
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}
is_six_2  <- function(x) {
  res <- rep(NA,length(x))
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}

Memory allocation

Let’s check which approach is faster:

microbenchmark::microbenchmark(is_six_1(data),
                               is_six_2(data), times = 1000)
Unit: microseconds
           expr min  lq   mean median  uq    max neval
 is_six_1(data) 1.7 2.0 5.5321    2.2 2.4 2685.9  1000
 is_six_2(data) 1.1 1.4 4.9058    1.5 1.6 2971.1  1000


is_six_2() is faster, as R doesn’t have to grow the res object in every iteration.

Vectorisation

R has many functions that are vectorised.

Instead of running a loop to check each value of our dice rolls, we can check them all at once by taking advantage of the ability of == to take vector input:

is_six_2  <- function(x) {
  res <- rep(NA,length(x))
  for (i in seq_along(x)) {
    res[i] <- x[i] == 6
  }
  res
}
is_six_3 <- function(x) {
  x == 6
}

Vectorisation

microbenchmark::microbenchmark(is_six_2(data), 
                               is_six_3(data), times = 1000)
Unit: nanoseconds
           expr  min   lq   mean median   uq     max neval
 is_six_2(data) 1400 1500 8002.3   1600 1700 6117200  1000
 is_six_3(data)  300  400 1802.0    400  500 1290100  1000


The vectorised version is_six_3() is much faster.

Vectorisation

Vectorised matrix functions like rowSums(), colSums() or rowMeans()can lead to impressive speedups:

data <- matrix(rnorm(10^4), nrow = 100)

microbenchmark::microbenchmark(
  apply(data, 2, sum),
  colSums(data),
  times = 1000)
Unit: microseconds
                expr   min    lq     mean median     uq     max neval
 apply(data, 2, sum) 145.1 228.9 419.8747  312.8 449.75 25309.9  1000
       colSums(data)   6.7   9.7  16.5095   12.3  19.60   131.8  1000

Rcpp

The Rcpp package allows for integrating C++ code with R, which can make R functions a lot faster. It requires the installation of a C++ compiler (R tools for Windows, Xcode for Mac, possibly “sudo apt-get install r-base-dev” on Linux)



I also highly recommend ChatGpt for help with creating C++ functions.

Rcpp

As an example, let’s compare a random walk implemented in R with one implemented with Rcpp.

First, the R version:

random_walk_R <- function(steps) {
  walk <- numeric(steps)
  for (i in 2:steps) {
    walk[i] <- walk[i-1] + ifelse(runif(1) > 0.5, 1, -1)
  }
  return(walk)
}

Rcpp

Next, the Rcpp version:

library(Rcpp)
sourceCpp(
'#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector random_walk_Rcpp(int steps) {
  NumericVector walk(steps);
  for(int i = 1; i < steps; i++) {
    walk[i] = walk[i-1] + (R::runif(0, 1) > 0.5 ? 1 : -1);
  }
  return walk;
}')

We now have the function random_walk_Rcpp in the global environment.

Rcpp

Let’s make sure both versions work:

plot(random_walk_R(1000), type = "l", ylim = c(-50,50))
points(random_walk_Rcpp(1000),type = "l", col = "red")

Rcpp

Now let us compare the speed:

microbenchmark::microbenchmark(
  random_walk_R(1000),
  random_walk_Rcpp(1000),
  times = 100)


Loops are much faster in C++!